Skip to content

chore: router ci improvement#2234

Merged
miklosbarabas merged 20 commits intomainfrom
miklos/eng-8201-chore-router-ci-improvements
Oct 3, 2025
Merged

chore: router ci improvement#2234
miklosbarabas merged 20 commits intomainfrom
miklos/eng-8201-chore-router-ci-improvements

Conversation

@miklosbarabas
Copy link
Copy Markdown
Contributor

@miklosbarabas miklosbarabas commented Sep 22, 2025

  • CI now performs Redis ACL setup and health checks inside the running container rather than on the host.
  • Host-level cluster setup and host-installed redis-tools are no longer used, reducing runner dependencies.
  • Standardized container-scoped port usage and dynamic container selection to improve CI stability and reduce flakiness.
  • No user-facing changes; improves build reliability and contributor experience.
  • Removed unnecessary unofficial GH Actions to reduce supply-chain attack surface

Summary by CodeRabbit

  • CI
    • Integration tests now provision a Redis cluster with health checks, automatic readiness waiting, dynamic ACL/config application, and simplified test invocation.
  • Tests
    • Added configurable retries for failed tests.
    • Flaky event-stream and batch tests moved into dedicated flaky suites.
    • A specific rate-limit cluster-mode test was converted to a non-executing placeholder pending follow-up.
  • Chores
    • Upgraded test-runner tooling to a newer version.

Checklist

- improved redis usage from the ci
- renamed TestNatsEvents to TestFlakyNatsEvents to be retried on failure
- commented out suspected faulty test
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 22, 2025

Walkthrough

CI adds three Redis cluster services, a readiness/initialization step, and dynamic ACL/config application. Several flaky tests are moved into new top-level test functions or commented out. gotestsum invocation and version updated, and test rerun support added to Makefiles.

Changes

Cohort / File(s) Summary of changes
CI workflow & Redis cluster orchestration
.github/workflows/router-ci.yaml
Adds public services redis-0, redis-1, redis-2 (image bitnamilegacy/redis-cluster:7.2) with env, healthchecks, ports; removes docker login/credentials under redis; adds “Wait for Redis Cluster” readiness/init step; replaces fixed-port ACL loop with container discovery and ACL application; removes prior dedicated cluster setup step; simplifies integration test invocation.
NATS flaky test extraction
router-tests/events/nats_events_test.go
Removes the embedded “subscribe ws with filter” block from TestNatsEvents and adds a new top-level func TestFlakyNatsEvents(t *testing.T) containing that logic.
Batch flaky test extraction
router-tests/batch_test.go
Moves/duplicates the inlined Batch Tracing assertions into a new top-level func TestFlakyBatch(t *testing.T) and removes the inlined nested test.
Rate limit test change
router-tests/ratelimit_test.go
Removes a specific Cluster Mode test case from the active matrix and leaves it commented with a TODO (test converted to a non-executed placeholder).
Test tooling / Makefile updates
router-tests/Makefile, Makefile
Adds test_retry_count ?= 0; updates gotestsum invocation in router-tests/Makefile to include --rerun-fails="$(test_retry_count)", --packages="$(test_target)", and -f $(FORMAT); bumps gotest.tools/gotestsum version from v1.12.3 to v1.13.0 in Makefile setup target.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly identifies the PR as non-user-facing (“chore”) and clearly indicates that the primary change is improving the router CI pipeline, matching the overall objectives described in the PR without extraneous detail.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@miklosbarabas miklosbarabas mentioned this pull request Sep 22, 2025
5 tasks
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 22, 2025

Router image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-7101448bdbf338d4891559945882db7d3ca43776

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/router-ci.yaml (1)

321-331: Regex excludes many legitimate tests; tighten to “not starting with TestFlaky”.

^Test[^(Flaky)] will skip any test whose first char after “Test” is F/l/a/k/y/“(”. Replace with an alternation that matches everything except TestFlaky….

-        run: make test test_params="-run '^Test[^(Flaky)]' --timeout=5m -p 1 --parallel 10" test_target="${{ matrix.test_target }}"
+        run: make test test_params="-run '^(Test[^F].*|TestF[^l].*|TestFl[^a].*|TestFla[^k].*|TestFlak[^y].*)$' --timeout=5m -p 1 --parallel 10" test_target="${{ matrix.test_target }}"

If you’d prefer simpler ops, an alternative is to keep -run '^Test' here and ensure all flaky suites are skipped via t.Skip behind an env flag (e.g., SKIP_FLAKY=1) and run them explicitly in the flaky step with that flag unset.

🧹 Nitpick comments (3)
router-tests/ratelimit_test.go (1)

703-707: Don’t comment out tests; skip with rationale and TODO.

Convert the commented case into a t.Run that calls t.Skipf(...) explaining the cluster default-user password caveat. Keeps the test discoverable and documents intent.

-				// Seems like not working as expected ? (if password is configured for the cluster default user)
-				//{
-				//	name:            "should successfully use auth from later url if no auth in first urls",
-				//	clusterUrlSlice: []string{"redis://localhost:7003", "rediss://localhost:7001", "rediss://cosmo:test@localhost:7002"},
-				//},
+				{
+					name:            "should successfully use auth from later url if no auth in first urls (skipped)",
+					clusterUrlSlice: []string{"redis://localhost:7003", "rediss://localhost:7001", "rediss://cosmo:test@localhost:7002"},
+				},

And in the loop:

-			for _, tt := range tests {
+			for _, tt := range tests {
 				t.Run(tt.name, func(t *testing.T) {
 					t.Parallel()
+					if strings.Contains(tt.name, "(skipped)") {
+						t.Skipf("Disabled under CI when cluster default user requires auth; follow-up needed to assert client URL auth merge semantics independently of server auth policy.")
+					}
.github/workflows/router-ci.yaml (2)

192-233: Redis Cluster services: good coverage; consider pinning image digests.

The 3-node cluster setup looks solid. To reduce supply-chain drift in CI, pin images to immutable digests (e.g., bitnamilegacy/redis-cluster:7.2@sha256:...).


309-319: Inline credentials in CI script (CKV_SECRET_4).

Even though this is test infra, embedding cosmo:test in URIs triggers secret scanners and isn’t future‑proof. Use env vars in the step and reference them.

-      - name: Configure Redis Authentication & ACL
-        run: |
+      - name: Configure Redis Authentication & ACL
+        env:
+          REDIS_TEST_USER: cosmo
+          REDIS_TEST_PASSWORD: test
+        run: |
           docker ps -a
           # Set a password for each master node
-          for cid in $(docker ps --format "{{.ID}} {{.Image}}" | grep "redis-cluster" | awk '{print $1}'); do
+          for cid in $(docker ps --format "{{.ID}} {{.Image}}" | grep "redis-cluster" | awk '{print $1}'); do
             echo "Configuring ACLs in container $cid"
-            docker exec "$cid" redis-cli -p 6379 ACL SETUSER cosmo on ">test" "~*" "+@all"
+            docker exec "$cid" redis-cli -p 6379 ACL SETUSER "$REDIS_TEST_USER" on ">$REDIS_TEST_PASSWORD" "~*" "+@all"
             docker exec "$cid" redis-cli -p 6379 ping
           done
-          cid=$(docker ps --format "{{.ID}} {{.Image}}" | grep "redis:7" | awk '{print $1}')
+          cid=$(docker ps --format "{{.ID}} {{.Image}}" | grep "redis:7" | awk '{print $1}')
           # Sanity checks
           docker exec "$cid" redis-cli -p 6379 ping
-          docker exec "$cid" redis-cli -u "redis://cosmo:test@redis-0:6379" ping
-          docker exec "$cid" redis-cli -u "redis://cosmo:test@redis-0:6379" cluster nodes
+          docker exec "$cid" redis-cli -u "redis://${REDIS_TEST_USER}:${REDIS_TEST_PASSWORD}@redis-0:6379" ping
+          docker exec "$cid" redis-cli -u "redis://${REDIS_TEST_USER}:${REDIS_TEST_PASSWORD}@redis-0:6379" cluster nodes
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae65619 and dac721e.

📒 Files selected for processing (3)
  • .github/workflows/router-ci.yaml (3 hunks)
  • router-tests/events/nats_events_test.go (1 hunks)
  • router-tests/ratelimit_test.go (1 hunks)
🧰 Additional context used
🪛 Checkov (3.2.334)
.github/workflows/router-ci.yaml

[medium] 317-318: Basic Auth Credentials

(CKV_SECRET_4)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_test
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
router-tests/events/nats_events_test.go (1)

65-66: Rename to TestFlaky… is fine; verify CI selection regex won’t over-exclude.

The rename cleanly gates this suite behind “flaky” runs. However, the CI regex used to exclude flakies (^Test[^(Flaky)]) will also skip any tests starting with TestF…, Testl…, Testa…, etc. Please update the workflow regex as suggested in router-ci.yaml to avoid unintentionally dropping non‑flaky tests.

Comment thread .github/workflows/router-ci.yaml
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
router-tests/Makefile (1)

4-4: Sanity‑check --rerun-fails behavior in your CI version of gotestsum.

Recent reports show --rerun-fails can sometimes yield a zero exit code despite failures, depending on version/output parsing. If you see anomalies, disable it or add --rerun-fails-report to surface retries. (github.com)

If helpful, I can add a CI‑guard to toggle reruns (e.g., RERUN_FLAKY?=1 and $(if $(RERUN_FLAKY),--rerun-fails=$(rerun_fails_count))).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dac721e and cfb3597.

📒 Files selected for processing (1)
  • router-tests/Makefile (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: Analyze (go)

Comment thread router-tests/Makefile Outdated
Comment thread router-tests/Makefile Outdated
- using gotestsum to retry flaky tests
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/router-ci.yaml (1)

159-161: Runner label typo blocks the job from scheduling.

Use ubuntu-latest; ubuntu-latest-l is invalid.

-    runs-on: ubuntu-latest-l
+    runs-on: ubuntu-latest
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfb3597 and 61f6f77.

📒 Files selected for processing (2)
  • .github/workflows/router-ci.yaml (3 hunks)
  • router-tests/Makefile (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router-tests/Makefile
🧰 Additional context used
🪛 Checkov (3.2.334)
.github/workflows/router-ci.yaml

[medium] 317-318: Basic Auth Credentials

(CKV_SECRET_4)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_test
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (2)
.github/workflows/router-ci.yaml (2)

256-294: Harden the “Wait for Redis Cluster” probe (strict mode, fail-fast, pick the creator by intent).

Same as the earlier suggestion; current script may silently pass/fail and pick an arbitrary node.

-      - name: Wait for Redis Cluster
-        run: |
-          echo "[CHECK] Waiting for Redis Cluster to become healthy..."
-          cluster_containers=$(docker ps --quiet --filter "ancestor=bitnamilegacy/redis-cluster:7.2")
-
-          success=0
-          for i in {1..30}; do
-            if [ $i -eq 1 ]; then
-              echo "[INIT] Forcing cluster creation..."
-              # pick one container as the "creator"
-              creator=$(echo $cluster_containers | awk '{print $1}')
-              # run the cluster create command inside it
-              docker exec "$creator" redis-cli --cluster create redis-0:6379 redis-1:6379 redis-2:6379 --cluster-replicas 0 --cluster-yes || true
-            fi
-
-            for cid in $cluster_containers; do
-              docker exec "$cid" redis-cli -p 6379 cluster info
-              if docker exec "$cid" redis-cli -p 6379 cluster info 2>/dev/null | grep -q "cluster_state:ok"; then
-                echo "[SUCCESS] Redis Cluster is ready (reported by $cid)"
-                success=1
-                break 2
-              fi
-            done
-
-            echo "[WAITING] Cluster not ready yet (attempt $i)..."
-            sleep 2
-          done
-
-          if [ $success -eq 0 ]; then
-            echo "[ERROR] Redis Cluster did not become healthy in time"
-            for cid in $cluster_containers; do
-              echo "--- Cluster info for $cid ---"
-              docker exec "$cid" redis-cli -p 6379 cluster info || true
-              docker exec "$cid" redis-cli -p 6379 cluster nodes || true
-              echo "--- Logs for $cid ---"
-              docker logs "$cid" | tail -n 100
-            done
-            exit 1
-          fi
+      - name: Wait for Redis Cluster
+        run: |
+          set -euo pipefail
+          echo "[CHECK] Waiting for Redis Cluster to become healthy..."
+          cluster_containers=$(docker ps --format '{{.ID}} {{.Names}} {{.Image}}' | awk '$3 ~ /bitnamilegacy\/redis-cluster:7\.2/ {print $1 "|" $2}')
+          if [ -z "${cluster_containers}" ]; then
+            echo "[ERROR] No redis-cluster containers found"; docker ps -a; exit 1
+          fi
+          # Prefer the intended creator (REDIS_CLUSTER_CREATOR=yes) via inspect; fallback to container named *redis-2*; else first
+          creator=""
+          while IFS= read -r pair; do
+            cid="${pair%%|*}"; name="${pair##*|}"
+            if docker inspect -f '{{range .Config.Env}}{{println .}}{{end}}' "$cid" | grep -q '^REDIS_CLUSTER_CREATOR=yes$'; then creator="$cid"; fi
+            [ -n "${first:-}" ] || first="$cid"
+            if [ -z "$creator" ] && echo "$name" | grep -q 'redis-2'; then creator="$cid"; fi
+          done <<< "$cluster_containers"
+          creator="${creator:-$first}"
+          echo "[INIT] Forcing cluster creation from $creator ..."
+          docker exec "$creator" redis-cli --cluster create redis-0:6379 redis-1:6379 redis-2:6379 --cluster-replicas 0 --cluster-yes || true
+
+          for i in {1..30}; do
+            for pair in $cluster_containers; do
+              cid="${pair%%|*}"
+              if docker exec "$cid" redis-cli -p 6379 cluster info 2>/dev/null | grep -q "cluster_state:ok"; then
+                echo "[SUCCESS] Redis Cluster is ready (reported by $cid)"; exit 0
+              fi
+            done
+            echo "[WAITING] Cluster not ready yet (attempt $i)..."; sleep 2
+          done
+          echo "[ERROR] Redis Cluster did not become healthy in time"
+          for pair in $cluster_containers; do
+            cid="${pair%%|*}"; echo "--- Cluster info for $cid ---"
+            docker exec "$cid" redis-cli -p 6379 cluster info || true
+            docker exec "$cid" redis-cli -p 6379 cluster nodes || true
+            echo "--- Logs for $cid ---"; docker logs "$cid" | tail -n 200 || true
+          done
+          exit 1

192-233: Keep port publishing — tests use localhost:7001/7002/7003

Search confirms tests reference the exposed host ports; do not remove port publishing.

Locations:

  • router-tests/testenv/testenv.go:399 — redisClusterHost = "redis://localhost:7001"
  • router-tests/ratelimit_test.go:686, 701, 710, 755 — clusterUrlSlice references localhost:7001/7002/7003
  • router-tests/automatic_persisted_queries_test.go:414 — clusterUrls include localhost:7001/7002

Comment thread .github/workflows/router-ci.yaml
- using gotestsum to retry flaky tests
- using gotestsum to retry flaky tests
- removed non-official custom gh actions to reduce supply-chain attack vector
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
.github/workflows/router-ci.yaml (1)

192-233: Redis cluster services: good direction; pin images and prefer deterministic creator selection.

  • Pin images by digest for determinism/supply-chain.
  • Minor: since only redis-2 is the creator, add a comment to keep this invariant or derive creator in the wait script (preferred).
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c635a4 and d105701.

📒 Files selected for processing (1)
  • .github/workflows/router-ci.yaml (3 hunks)
🧰 Additional context used
🪛 Checkov (3.2.334)
.github/workflows/router-ci.yaml

[medium] 317-318: Basic Auth Credentials

(CKV_SECRET_4)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_test
  • GitHub Check: Analyze (go)
🔇 Additional comments (3)
.github/workflows/router-ci.yaml (3)

256-294: Harden the wait probe; fail-fast, pick the intended creator, keep diagnostics.

Same concern as the prior review: add strict shell, verify containers found, prefer the creator node, and preserve logs on failure.

-      - name: Wait for Redis Cluster
-        run: |
-          echo "[CHECK] Waiting for Redis Cluster to become healthy..."
-          cluster_containers=$(docker ps --quiet --filter "ancestor=bitnamilegacy/redis-cluster:7.2")
-
-          success=0
-          for i in {1..30}; do
-            if [ $i -eq 1 ]; then
-              echo "[INIT] Forcing cluster creation..."
-              # pick one container as the "creator"
-              creator=$(echo $cluster_containers | awk '{print $1}')
-              # run the cluster create command inside it
-              docker exec "$creator" redis-cli --cluster create redis-0:6379 redis-1:6379 redis-2:6379 --cluster-replicas 0 --cluster-yes || true
-            fi
-
-            for cid in $cluster_containers; do
-              docker exec "$cid" redis-cli -p 6379 cluster info
-              if docker exec "$cid" redis-cli -p 6379 cluster info 2>/dev/null | grep -q "cluster_state:ok"; then
-                echo "[SUCCESS] Redis Cluster is ready (reported by $cid)"
-                success=1
-                break 2
-              fi
-            done
-
-            echo "[WAITING] Cluster not ready yet (attempt $i)..."
-            sleep 2
-          done
-
-          if [ $success -eq 0 ]; then
-            echo "[ERROR] Redis Cluster did not become healthy in time"
-            for cid in $cluster_containers; do
-              echo "--- Cluster info for $cid ---"
-              docker exec "$cid" redis-cli -p 6379 cluster info || true
-              docker exec "$cid" redis-cli -p 6379 cluster nodes || true
-              echo "--- Logs for $cid ---"
-              docker logs "$cid" | tail -n 100
-            done
-            exit 1
-          fi
+      - name: Wait for Redis Cluster
+        run: |
+          set -euo pipefail
+          echo "[CHECK] Waiting for Redis Cluster to become healthy..."
+          cluster_containers=$(docker ps --format '{{.ID}} {{.Names}} {{.Image}}' | awk '$3 ~ /bitnamilegacy\/redis-cluster:7\.2/ {print $1 "|" $2}')
+          if [ -z "${cluster_containers}" ]; then
+            echo "[ERROR] No redis-cluster containers found"; docker ps -a; exit 1
+          fi
+          creator=""
+          first=""
+          while IFS= read -r line; do
+            cid="${line%%|*}"; name="${line##*|}"
+            [ -n "${first}" ] || first="$cid"
+            if echo "$name" | grep -q 'redis-2'; then creator="$cid"; fi
+          done <<< "$cluster_containers"
+          creator="${creator:-$first}"
+          echo "[INIT] Forcing cluster creation from $creator ..."
+          timeout 5s docker exec "$creator" redis-cli --cluster create redis-0:6379 redis-1:6379 redis-2:6379 --cluster-replicas 0 --cluster-yes || true
+
+          for i in {1..30}; do
+            for pair in $cluster_containers; do
+              cid="${pair%%|*}"
+              if timeout 3s docker exec "$cid" redis-cli -p 6379 cluster info 2>/dev/null | grep -q "cluster_state:ok"; then
+                echo "[SUCCESS] Redis Cluster is ready (reported by $cid)"; exit 0
+              fi
+            done
+            echo "[WAITING] Cluster not ready yet (attempt $i)..."; sleep 2
+          done
+          echo "[ERROR] Redis Cluster did not become healthy in time"
+          for pair in $cluster_containers; do
+            cid="${pair%%|*}"; echo "--- Cluster info for $cid ---"
+            docker exec "$cid" redis-cli -p 6379 cluster info || true
+            docker exec "$cid" redis-cli -p 6379 cluster nodes || true
+            echo "--- Logs for $cid ---"; docker logs "$cid" | tail -n 200 || true
+          done
+          exit 1

309-318: Don’t embed credentials; use env/secrets and redis-cli --user/-a. Fix CKV_SECRET_4.

This leaks creds into logs and triggers Checkov. Use masked env and avoid URIs.

-          # Set a password for each master node
-          for cid in $(docker ps --format "{{.ID}} {{.Image}}" | grep "redis-cluster" | awk '{print $1}'); do
-            echo "Configuring ACLs in container $cid"
-            docker exec "$cid" redis-cli -p 6379 ACL SETUSER cosmo on ">test" "~*" "+@all"
-            docker exec "$cid" redis-cli -p 6379 ping
-          done
-          cid=$(docker ps --format "{{.ID}} {{.Image}}" | grep "redis:7" | awk '{print $1}')
-          # Sanity checks
-          docker exec "$cid" redis-cli -p 6379 ping
-          docker exec "$cid" redis-cli -u "redis://cosmo:test@redis-0:6379" ping
-          docker exec "$cid" redis-cli -u "redis://cosmo:test@redis-0:6379" cluster nodes
+          # Configure ACLs (secrets masked by Actions)
+          ACL_USER="${ACL_USER:-cosmo}"
+          ACL_PASS="${CI_REDIS_PASS:-test}"
+          echo "::add-mask::$ACL_PASS"
+          for cid in $(docker ps --format '{{.ID}} {{.Image}}' | awk '$2 ~ /redis-cluster/ {print $1}'); do
+            echo "Configuring ACLs in container $cid"
+            docker exec "$cid" redis-cli -p 6379 ACL SETUSER "$ACL_USER" on ">$ACL_PASS" "~*" "+@all"
+            docker exec "$cid" redis-cli -p 6379 ping >/dev/null
+          done
+          cid=$(docker ps --format '{{.ID}} {{.Image}}' | awk '$2=="redis:7"{print $1}' | head -n1)
+          # Sanity checks (no URI with creds in logs)
+          docker exec "$cid" redis-cli -h redis-0 -p 6379 --user "$ACL_USER" -a "$ACL_PASS" ping
+          docker exec "$cid" redis-cli -h redis-0 -p 6379 --user "$ACL_USER" -a "$ACL_PASS" cluster nodes

Optionally add to the step:
env:
CI_REDIS_PASS: ${{ secrets.CI_REDIS_PASS }}
ACL_USER: cosmo


321-321: Retries confirmed — no change needed

router-tests/Makefile line 15 invokes gotestsum with --rerun-fails="$(test_retry_count)", so the workflow's test invocation will trigger retries.

Comment thread .github/workflows/router-ci.yaml
Comment thread .github/workflows/router-ci.yaml
Comment thread router-tests/Makefile
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
.github/workflows/router-ci.yaml (1)

192-233: Redis Cluster services: consider avoiding host port exposure and pin image by digest.

  • You can rely on service DNS (redis-0/1/2:6379) from the job container and drop host port mappings (7001-7003) to reduce collision risk.
  • Pin bitnamilegacy/redis-cluster:7.2 by digest to harden supply chain and avoid accidental tag drift.
  • If DockerHub rate limiting is a concern, consider mirroring to GHCR or using an org-managed cache.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d105701 and 13abcb8.

📒 Files selected for processing (1)
  • .github/workflows/router-ci.yaml (3 hunks)
🧰 Additional context used
🪛 Checkov (3.2.334)
.github/workflows/router-ci.yaml

[medium] 317-318: Basic Auth Credentials

(CKV_SECRET_4)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-router
  • GitHub Check: build_test
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
  • GitHub Check: build_push_image
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (5)
.github/workflows/router-ci.yaml (5)

256-294: Harden the Redis Cluster wait/probe; fail fast and prefer the creator node.

Add strict shell options, error out when no containers found, and prefer the intended creator (redis-2). Keep diagnostics on failure.

-      - name: Wait for Redis Cluster
-        run: |
-          echo "[CHECK] Waiting for Redis Cluster to become healthy..."
-          cluster_containers=$(docker ps --quiet --filter "ancestor=bitnamilegacy/redis-cluster:7.2")
-
-          success=0
-          for i in {1..30}; do
-            if [ $i -eq 1 ]; then
-              echo "[INIT] Forcing cluster creation..."
-              # pick one container as the "creator"
-              creator=$(echo $cluster_containers | awk '{print $1}')
-              # run the cluster create command inside it
-              docker exec "$creator" redis-cli --cluster create redis-0:6379 redis-1:6379 redis-2:6379 --cluster-replicas 0 --cluster-yes || true
-            fi
-
-            for cid in $cluster_containers; do
-              docker exec "$cid" redis-cli -p 6379 cluster info
-              if docker exec "$cid" redis-cli -p 6379 cluster info 2>/dev/null | grep -q "cluster_state:ok"; then
-                echo "[SUCCESS] Redis Cluster is ready (reported by $cid)"
-                success=1
-                break 2
-              fi
-            done
-
-            echo "[WAITING] Cluster not ready yet (attempt $i)..."
-            sleep 2
-          done
-
-          if [ $success -eq 0 ]; then
-            echo "[ERROR] Redis Cluster did not become healthy in time"
-            for cid in $cluster_containers; do
-              echo "--- Cluster info for $cid ---"
-              docker exec "$cid" redis-cli -p 6379 cluster info || true
-              docker exec "$cid" redis-cli -p 6379 cluster nodes || true
-              echo "--- Logs for $cid ---"
-              docker logs "$cid" | tail -n 100
-            done
-            exit 1
-          fi
+      - name: Wait for Redis Cluster
+        run: |
+          set -euo pipefail
+          echo "[CHECK] Waiting for Redis Cluster to become healthy..."
+          cluster_containers=$(docker ps --format '{{.ID}} {{.Names}} {{.Image}}' | awk '$3 ~ /bitnamilegacy\/redis-cluster:7\.2/ {print $1 "|" $2}')
+          if [ -z "${cluster_containers}" ]; then
+            echo "[ERROR] No redis-cluster containers found"; docker ps -a; exit 1
+          fi
+          # Prefer the creator node if present; otherwise pick the first
+          creator=""
+          while IFS= read -r line; do
+            cid="${line%%|*}"; name="${line##*|}"
+            if echo "$name" | grep -q "redis-2"; then creator="$cid"; fi
+            [ -n "${first:-}" ] || first="$cid"
+          done <<< "$cluster_containers"
+          creator="${creator:-$first}"
+          echo "[INIT] Forcing cluster creation from $creator ..."
+          docker exec "$creator" redis-cli --cluster create redis-0:6379 redis-1:6379 redis-2:6379 --cluster-replicas 0 --cluster-yes || true
+
+          for i in {1..30}; do
+            for pair in $cluster_containers; do
+              cid="${pair%%|*}"
+              if docker exec "$cid" redis-cli -p 6379 cluster info 2>/dev/null | grep -q "cluster_state:ok"; then
+                echo "[SUCCESS] Redis Cluster is ready (reported by $cid)"; exit 0
+              fi
+            done
+            echo "[WAITING] Cluster not ready yet (attempt $i)..."; sleep 2
+          done
+          echo "[ERROR] Redis Cluster did not become healthy in time"
+          for pair in $cluster_containers; do
+            cid="${pair%%|*}"; echo "--- Cluster info for $cid ---"
+            docker exec "$cid" redis-cli -p 6379 cluster info || true
+            docker exec "$cid" redis-cli -p 6379 cluster nodes || true
+            echo "--- Logs for $cid ---"; docker logs "$cid" | tail -n 200 || true
+          done
+          exit 1

306-318: Don’t embed credentials; pass via env and use redis-cli --user/-a.

Avoid leaking creds into logs and satisfy CKV_SECRET_4.

       # Set a password for each master node
-      for cid in $(docker ps --format "{{.ID}} {{.Image}}" | grep "redis-cluster" | awk '{print $1}'); do
-        echo "Configuring ACLs in container $cid"
-        docker exec "$cid" redis-cli -p 6379 ACL SETUSER cosmo on ">test" "~*" "+@all"
-        docker exec "$cid" redis-cli -p 6379 ping
+      ACL_USER="${ACL_USER:-cosmo}"
+      ACL_PASS="${ACL_PASS:-test}"
+      export ACL_USER ACL_PASS
+      set +x
+      for cid in $(docker ps --format '{{.ID}} {{.Image}}' | awk '$2=="bitnamilegacy/redis-cluster:7.2"{print $1}'); do
+        echo "Configuring ACLs in container $cid"
+        docker exec "$cid" redis-cli -p 6379 ACL SETUSER "$ACL_USER" on ">$ACL_PASS" "~*" "+@all"
+        docker exec "$cid" redis-cli -p 6379 --user "$ACL_USER" -a "$ACL_PASS" ping
       done
-      cid=$(docker ps --format "{{.ID}} {{.Image}}" | grep "redis:7" | awk '{print $1}')
+      cid=$(docker ps --format '{{.ID}} {{.Image}}' | awk '$2=="redis:7"{print $1}' | head -n1)
       # Sanity checks
-      docker exec "$cid" redis-cli -p 6379 ping
-      docker exec "$cid" redis-cli -u "redis://cosmo:test@redis-0:6379" ping
-      docker exec "$cid" redis-cli -u "redis://cosmo:test@redis-0:6379" cluster nodes
+      docker exec "$cid" redis-cli -h redis-0 -p 6379 --user "$ACL_USER" -a "$ACL_PASS" ping
+      docker exec "$cid" redis-cli -h redis-0 -p 6379 --user "$ACL_USER" -a "$ACL_PASS" cluster nodes

Note: if you can use repo secrets, prefer setting ACL_USER/ACL_PASS via secrets and omit the default.


192-233: Claim vs implementation: host ports mapped contradict “container-scoped” usage.

PR summary says “container-scoped port usage,” but 7001-7003 are host-published here.

Can tests resolve redis-0/1/2 by service DNS and drop the port publishes?


159-159: Verify runner label “ubuntu-latest-l”.

Is this a custom self-hosted label? If not, GitHub’s standard label is “ubuntu-latest”.


321-321: LGTM on test invocation.

The tighter timeout and explicit parallelism look fine.

@miklosbarabas miklosbarabas requested a review from endigma October 2, 2025 22:24
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
router-tests/ratelimit_test.go (1)

703-707: Consider opening a tracking issue for the commented-out test.

The TODO note explains that the test was a false positive and requires investigation. To ensure this task is not forgotten, consider opening a tracking issue (if one doesn't already exist) to investigate and fix the underlying functionality.

Do you want me to open a new issue to track this task?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53e7e88 and e27cc5c.

📒 Files selected for processing (1)
  • router-tests/ratelimit_test.go (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: build-router
  • GitHub Check: Analyze (go)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_push_image
  • GitHub Check: image_scan
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./events)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: build_test

@miklosbarabas miklosbarabas enabled auto-merge (squash) October 3, 2025 06:15
Comment thread router-tests/events/nats_events_test.go Outdated
Copy link
Copy Markdown
Member

@endigma endigma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread router-tests/batch_test.go Outdated
Comment thread router-tests/events/nats_events_test.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
router-tests/events/nats_events_test.go (2)

1614-1724: Consider documenting the flakiness root cause.

The test implementation looks correct and comprehensive. However, isolating flaky tests without addressing the root cause can lead to accumulating technical debt, as noted in past review comments.

Consider adding a comment explaining:

  1. Why this test is flaky (e.g., timing-sensitive SSE event delivery, NATS message ordering)
  2. Whether the flakiness is environmental or indicates a potential product issue
  3. A reference to track investigation of the root cause (e.g., issue number)

This documentation will help future contributors understand whether the flakiness is acceptable or needs fixing.

Example:

 func TestFlakyNatsEvents(t *testing.T) {
 	t.Parallel()
+	// TODO(ENG-XXXX): This test is flaky due to timing sensitivity in SSE event delivery
+	// when multiple events are published rapidly. The test occasionally misses events
+	// or receives them out of order. This is likely a test infrastructure issue rather
+	// than a product bug, but should be investigated.
 
 	t.Run("subscribe sse with filter", func(t *testing.T) {

1614-1616: Establish pattern for additional flaky tests.

Good separation of flaky tests into a dedicated top-level function. If additional tests need to be marked as flaky in the future, consider adding them as sub-tests within TestFlakyNatsEvents rather than creating multiple TestFlaky* functions. This keeps all flaky tests in one place and makes CI configuration simpler.

Based on past review comments, which mentioned potential flakiness in Kafka and Redis tests.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e27cc5c and 6ba2c3c.

📒 Files selected for processing (2)
  • router-tests/events/nats_events_test.go (1 hunks)
  • router-tests/ratelimit_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router-tests/ratelimit_test.go
🧰 Additional context used
🧬 Code graph analysis (1)
router-tests/events/nats_events_test.go (2)
router-tests/testenv/testenv.go (4)
  • Run (107-124)
  • Config (286-343)
  • ConfigWithEdfsNatsJSONTemplate (87-87)
  • Environment (1731-1767)
router-tests/testenv/utils.go (1)
  • AwaitChannelWithT (10-19)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_test
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (1)
router-tests/events/nats_events_test.go (1)

1614-1724: Confirm removal of "subscribe sse with filter" from TestNatsEvents; only one occurrence remains in TestFlakyNatsEvents.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
router-tests/batch_test.go (1)

750-803: Clarify test duplication and flaky subset.

This new TestFlakyBatch function appears to duplicate logic from the existing "Verify all root span attributes for batch requests" test in TestBatch (lines 565-631), but with significantly reduced attribute coverage (7 attributes vs 24).

Given the past review discussion about isolating specific flaky tests, please clarify:

  1. Is this the minimal flaky subset, or are you still determining which specific assertions are flaky?
  2. Why were these particular 7 attributes selected for the flaky test?
  3. Should the original comprehensive test in TestBatch remain, or will it be removed once flakiness is resolved?

The reduced coverage could mask issues with the untested attributes, so understanding the rationale will help ensure test quality.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 34f2aa8 and 4cfadd8.

📒 Files selected for processing (1)
  • router-tests/batch_test.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
router-tests/batch_test.go (4)
router-tests/testenv/testenv.go (4)
  • Run (107-124)
  • Config (286-343)
  • Environment (1731-1767)
  • GraphQLRequest (1907-1915)
router/pkg/trace/tracetest/tracetest.go (1)
  • NewInMemoryExporter (11-17)
router/pkg/config/config.go (2)
  • Config (998-1072)
  • BatchingConfig (879-884)
router/pkg/otel/attributes.go (7)
  • WgRouterConfigVersion (21-21)
  • WgRouterRootSpan (32-32)
  • WgIsBatchingOperation (49-49)
  • WgBatchingOperationsCount (50-50)
  • WgOperationHash (14-14)
  • WgClientName (18-18)
  • WgClientVersion (19-19)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: image_scan
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_test
  • GitHub Check: integration_test (./events)
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (1)
router-tests/batch_test.go (1)

791-791: Dynamic config version check differs from original test.

Line 791 uses xEnv.RouterConfigVersionMain() for dynamic version checking, while the original test at line 610 uses sa.HasValue(otel.WgRouterConfigVersion) to check for attribute existence.

This difference could explain flakiness if the config version value changes across test runs. However, the original comprehensive test also passes, suggesting this might not be the root cause.

Consider verifying whether the config version value is stable in your CI environment.

@miklosbarabas miklosbarabas merged commit fea4d1d into main Oct 3, 2025
29 checks passed
@miklosbarabas miklosbarabas deleted the miklos/eng-8201-chore-router-ci-improvements branch October 3, 2025 12:21
@coderabbitai coderabbitai Bot mentioned this pull request Nov 25, 2025
5 tasks
@coderabbitai coderabbitai Bot mentioned this pull request Jan 14, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants